Following atkins:91 and atkins_levin:91, I would like to suggest that an adequate computational lexicon can only be established on the basis of top-down design derived from a linguistic theory in combination with bottom-up information derived from corpora about specific usage of language. Such an approach is supported by evidence that the probabilistic model of bruce_wiebe:95 improves in accuracy when augmented with analytical (theoretically derived) knowledge. The information derived form corpora might include, as suggested by inter alia Krovetz (1991), sense frequency information, co-occurrence relations and collocations. It should also include idioms and representation of proper nouns, which establish contexts in which a word can take on non-compositional meanings.
The notion of harnessing linguistically-derived insights to aid lexicon design and automatic lexical acquisition has also been convincingly advocated by Light light:96, who shows that surface cues, such as morphological features of a word, can have consistent correspondences to lexical semantic features associated with that word (or its base form). For example, the prefix un- applied to a verb (e.g. unlatch, unhinge) signals that that verb is a member of the telic aspectual class. Such correspondences, once identified on the basis of theoretical research, can be utilised to establish lexical semantic structures for words through corpus analysis. Light demonstrates the utility of morphological cues for identifying a range of lexical semantic properties, ranging from aspectual class to general semantic relation (e.g. change-of-state-rel) to antonymy. Corpus analysis driven by surface cue-lexical semantic correspondences can clearly play a useful role in automatic lexicon acquisition, but it relies on linguistic observations of those correspondences.
The linguistic analysis of logical metonymy in Chapter 5 resulted in identification of certain semantic information which would need to be represented in the lexicon in order to accurately model the conventionality of the phenomenon while still capturing a generalisation about how logical metonymy takes place. To automatically acquire the appropriate representation, corpora would need to be analysed for evidence of specific components of qualia structure. This corpus analysis would very clearly have to be guided by the linguistic theory underlying the explanatory model, including assumptions of the generative devices encoded in the lexicon (e.g. Pustejovsky 1991, pustejovsky:95a), since the results of the acquisition depend on a particular view of the processes involved in logical metonymy and a particular view of the kind of lexical structure associated with nouns.
Let us consider how the automatic acquisition of the knowledge relevant to logical metonymy might proceed, given the theoretical analysis in Chapter 5 which assumes that logical metonymy always occurs with respect to either the agentive or telic roles of a noun, but that these roles are not represented in the lexical entry of every noun. Although a certain amount of the work of acquiring qualia structure can apparently proceed via automatic means, some of it still must be built up by hand due to the interpretation required to establish whether or not the telic role should be represented for a particular noun, as will be pointed out in step (22c) below.
What do the needs of the process described above tell us about the framework which must already be in place before this specific corpus analysis can proceed?
We also would like to use corpora to identify the frequency with which a certain word undergoes a potential alternation, as suggested by copestake_briscoe:95. For sense extensions which have no syntactic reflexes, this is a virtually impossible task, even in corpora that have been processed for syntactic structure. This is because there will be no basis for distinguishing one sense from another in the corpus. However, many sense extensions do have syntactic effects and therefore a parsed corpus can provide the basis for identifying the frequency of some of the different senses of a word.
The addition of rudimentary semantic tagging to the corpus would also aid in calculating the frequency of various sense extensions, particularly if the lexicon is augmented to include certain selectional restrictions. For example, the verb eat would likely specify that its eaten complement is foodstuff or something similar. In the context of eat, then, a noun phrase like the lamb (e.g. John ate the lamb) would be interpreted under its meat sense rather than its animal sense. This kind of information could guide the identification of a use of a word with a particular sense.
What does the previous discussion tell us about what the corpus needs to look like in order to support the desired processing? Most corpora in existence have at most part-of-speech tagging (e.g. the BNC) resulting from shallow parsing. They can be useful for identifying collocations and general co-occurrence frequencies. However, in order to identify semantic relationships, the corpus must be given more structure. Specifically, I suggest the following desiderata:
In conclusion, the extraction of information useful to advanced NLP tasks from a corpus demands a certain level of linguistic sophistication both from the corpus and from the framework which drives the corpus analysis. This information will ultimately be necessary in order for computational systems to achieve the capability to handle the problems posed by polysemy and the creativity of language use.